Pattern Discovery from Biosequences
نویسنده
چکیده
In this thesis we have developed novel methods for analyzing biological data, the primary sequences of the DNA and proteins, the microarray based gene expression data, and other functional genomics data. The main contribution is the development of the pattern discovery algorithm SPEXS, accompanied by several practical applications for analyzing real biological problems. For performing these biological studies that integrate different types of biological data we have developed a comprehensive web-based biological data analysis environment Expression Profiler (http://ep.ebi.ac.uk/). Biosequences, i.e., the primary sequences of DNA, RNA, and protein molecules, represent the most basic type of biological information. Features of these sequences that are reused by nature help us to understand better the basic mechanisms of gene structure, function, and regulation. The SPEXS algorithm has been developed for the discovery of the biologically relevant features that can be represented in the form of sequence patterns. SPEXS is a fast exhaustive search algorithm for the class of generalized regular patterns. This class is essentially the same as used in the PROSITE pattern database, i.e. it allows patterns to consist of fixed character positions, group character positions (ambiguities), and wildcards of variable lengths. The biological relevance of the patterns can be estimated according to several different mathematical criteria, which have to be chosen according to the application. We have used SPEXS for the analysis of real biological problems, where we have been able to find biologically meaningful patterns in a variety of different applications. For example, we have studied gene regulation mechanisms by a systematic
منابع مشابه
Knowledge Discovery in Biosequences Using Sort Regular Patterns
This paper considers knowledge discovery by sort regular patterns, which are strings over sort letters representing nite sets of basic letters. We devise a learning algorithm for the class based on the minimal multiple generalization technique, and evaluate the method by experiments on biosequences from GenBank database. The experiments show that relatively a simple sort pattern can represent a...
متن کاملUnsupervised Pattern Discovery in Biosequences Using Aligned Pattern Clustering
Protein, RNA and DNA are made up of sequences of amino acids/nucleotides, which interact among themselves via binding. For example, (1) protein-DNA binding regulates gene transcription [1]; and (2) Protein-protein binding plays important roles in cell cycle control and signal transduction [2].The binding is maintained by either the direct participation or assistance of conserved short segments ...
متن کاملReports in Informatics Approaches to the Automatic Discovery of Patterns in Biosequences
Approaches to the automatic discovery of patterns in biosequences. Abstract This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classiication of pattern languages in this class is developed, covering those patterns which are the most frequently...
متن کاملReports in Informatics Relation Patterns and Their Automatic Discovery in Biosequences Relation Patterns and Their Automatic Discovery in Biosequences
We have extended the pattern language used in PROSITE to enable it to describe dependencies between amino acid residues. We have developed a minimum description length principle based tness measure evaluating the signiicance of such patterns in relation to a set of sequences, and an algorithm automatically nding signiicant patterns in unaligned sequences. Computing experiments are reported show...
متن کاملApproaches to the Automatic Discovery of Patterns in Biosequences
This paper surveys approaches to the discovery of patterns in biosequences and places these approaches within a formal framework that systematises the types of patterns and the discovery algorithms. Patterns with expressive power in the class of regular languages are considered, and a classification of pattern languages in this class is developed, covering the patterns that are the most frequen...
متن کاملDiscovering Patterns and Subfamilies in Biosequences
We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set...
متن کامل